Skip to content

Conversation

@mattevans
Copy link
Member

  • introduce new CLI commands: xcli lab xatu-cbt and xcli lab xatu-cbt generate-seed-data
  • add AWS SDK v2 dependencies for S3/R2 upload
  • implement interactive and scripted modes for data extraction
  • support filtering by model, network, spec, range, and custom SQL
  • auto-generate xatu-cbt test YAML templates after extraction
  • upload parquet files to ethpandaops R2 bucket with overwrite check
  • document usage and S3 credential setup in README

- introduce new CLI commands: `xcli lab xatu-cbt` and
 `xcli lab xatu-cbt generate-seed-data`
- add AWS SDK v2 dependencies for S3/R2 upload
- implement interactive and scripted modes for data extraction
- support filtering by model, network, spec, range, and custom SQL
- auto-generate xatu-cbt test YAML templates after extraction
- upload parquet files to ethpandaops R2 bucket with overwrite check
- document usage and S3 credential setup in README
@mattevans mattevans self-assigned this Dec 10, 2025
mattevans and others added 16 commits December 11, 2025 10:56
Introduce a new CLI sub-command that automates the creation of
complete test YAML files for transformation models. The command
resolves the full dependency tree, queries external ClickHouse for
available data ranges, generates seed parquet files for every
external dependency, and optionally uploads them to S3. It also
supports AI-generated assertions via Claude.

- New command: `xcli lab xatu-cbt generate-transformation-test`
- Dependency tree resolution with cycle detection
- Range intersection across all external models
- Batch parquet generation and S3 upload
- AI assertion generation using Claude CLI
- Interactive and scripted modes
…load

perf(seeddata): replace MIN/MAX with ORDER BY LIMIT 1 for faster range queries
refactor(seeddata): split range query into two single-value queries
chore(seeddata): reduce query timeout from 2m to 30s
Allow callers to force a specific range column instead of using
per-model detection. This enables consistent range queries across
models when needed.
…nctions

This introduces functionality to sanitize IPv4 and IPv6 columns during seed data generation by hashing them with a shared salt, ensuring consistent anonymization across related data sets while preserving the original IP address type structure (IPv4 vs IPv6, including IPv4-mapped IPv6). This required adding salt generation, ClickHouse schema introspection (`DESCRIBE TABLE`), and dynamic SQL query construction to replace raw column selection with sanitization expressions in `lab_xatu_cbt_generate_seed_data` and `lab_xatu_cbt_generate_transformation` commands.
… transformation

This change updates the seed data generation and transformation commands to report which IP columns were sanitized during the process, improving user visibility into data masking operations.
feat: Sanitize IPs for parquet file uploads
…est generation

This change introduces predefined time range presets (e.g., "Last 5 minutes", "Last 1 hour") to simplify selecting the time window for seed data generation during transformation tests, along with accounting for ingestion lag when calculating the effective maximum time.
…r-pt2

feat: add generate-transformation-test command for xatu-cbt
- Remove rangePresets, ingestionLagBuffer and manual range prompts
- Add --duration flag and new seeddata/discovery.go module
- Integrate Claude-based range strategy generation with fallback heuristic
- Validate data availability per model before generation
- Streamline UX: single duration prompt, AI summary, confirmation flow
…L filter analysis

- add support for entity/dimension tables (no time range) via intervalType
- read intermediate transformation SQL to extract WHERE clause filters
- extend discovery prompt to include correlation filters for dimension tables
- add FilterSQL and CorrelationFilter to TableRangeStrategy for precise filtering
- improve fallback discovery to handle entity models and missing ranges
- normalize YAML field names and fix unquoted datetime values in Claude responses
- extend QueryRowCount and GenerateOptions to accept additional SQL filters
- add S3 Cache-Control: no-cache header for fresh seed data downloads
feat: enhance seed-data discovery with dimension table support and SQL filter analysis
feat: replace manual range selection with AI-driven discovery
- add detection for common field name typos like "primaryrangeType"
- expand normalizeDiscoveryYAMLFields to handle PascalCase, snake_case
 and other variations Claude might output
- enhance discovery prompt with stricter formatting rules and examples
- include full YAML content in error messages for easier debugging
- log when YAML normalization actually changes field names
@mattevans mattevans merged commit c880319 into master Dec 15, 2025
3 checks passed
@mattevans mattevans deleted the feat/xatu-cbt-parquet-exporter branch December 15, 2025 03:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants